---
id: guides-building-custom-metrics
title: Building Custom LLM Metrics
sidebar_label: Building Custom Metrics
---

<head>
  <link
    rel="canonical"
    href="https://deepeval.com/guides/guides-building-custom-metrics"
  />
</head>

In `deepeval`, anyone can easily build their own custom LLM evaluation metric that is automatically integrated within `deepeval`'s ecosystem, which includes:

- Running your custom metric in **CI/CD pipelines**.
- Taking advantage of `deepeval`'s capabilities such as **metric caching and multi-processing**.
- Have custom metric results **automatically sent to Confident AI**.

Here are a few reasons why you might want to build your own LLM evaluation metric:

- **You want greater control** over the evaluation criteria used (and you think [`GEval`](#metrics-llm-evals) is insufficient).
- **You don't want to use an LLM** for evaluation (since all metrics in `deepeval` are powered by LLMs).
- **You wish to combine several `deepeval` metrics** (eg., it makes a lot of sense to have a metric that checks for both answer relevancy and faithfulness).

:::info
There are many ways one can implement an LLM evaluation metric. Here is a [great article on everything you need to know about scoring LLM evaluation metrics.](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation)
:::

## Rules To Follow When Creating A Custom Metric

### 1. Inherit the `BaseMetric` class

To begin, create a class that inherits from `deepeval`'s `BaseMetric` class:

```python
from deepeval.metrics import BaseMetric

class CustomMetric(BaseMetric):
    ...
```

This is important because the `BaseMetric` class will help `deepeval` acknowledge your custom metric during evaluation.

### 2. Implement the `__init__()` method

The `BaseMetric` class gives your custom metric a few properties that you can configure and be displayed post-evaluation, either locally or on Confident AI.

An example is the `threshold` property, which determines whether the `LLMTestCase` being evaluated has passed or not. Although **the `threshold` property is all you need to make a custom metric functional**, here are some additional properties for those who want even more customizability:

- `evaluation_model`: a `str` specifying the name of the evaluation model used.
- `include_reason`: a `bool` specifying whether to include a reason alongside the metric score. This won't be needed if you don't plan on using an LLM for evaluation.
- `strict_mode`: a `bool` specifying whether to pass the metric only if there is a perfect score.
- `async_mode`: a `bool` specifying whether to execute the metric asynchronously.

:::tip
Don't read too much into the advanced properties for now, we'll go over how they can be useful in later sections of this guide.
:::

The `__init__()` method is a great place to set these properties:

```python
from deepeval.metrics import BaseMetric

class CustomMetric(BaseMetric):
    def __init__(
        self,
        threshold: float = 0.5,
        # Optional
        evaluation_model: str,
        include_reason: bool = True,
        strict_mode: bool = True,
        async_mode: bool = True
    ):
        self.threshold = threshold
        # Optional
        self.evaluation_model = evaluation_model
        self.include_reason = include_reason
        self.strict_mode = strict_mode
        self.async_mode = async_mode
```

### 3. Implement the `measure()` and `a_measure()` methods

The `measure()` and `a_measure()` method is where all the evaluation happens. In `deepeval`, evaluation is the process of applying a metric to an `LLMTestCase` to generate a score and optionally a reason for the score (if you're using an LLM) based on the scoring algorithm.

The `a_measure()` method is simply the asynchronous implementation of the `measure()` method, and so they should both use the same scoring algorithm.

:::info
The `a_measure()` method allows `deepeval` to run your custom metric asynchronously. Take the `assert_test` function for example:

```python
from deepeval import assert_test

def test_multiple_metrics():
    ...
    assert_test(test_case, [metric1, metric2], run_async=True)
```

When you run `assert_test()` with `run_async=True` (which is the default behavior), `deepeval` calls the `a_measure()` method which allows all metrics to run concurrently in a non-blocking way.
:::

Both `measure()` and `a_measure()` **MUST**:

- accept an `LLMTestCase` as argument
- set `self.score`
- set `self.success`

You can also optionally set `self.reason` in the measure methods (if you're using an LLM for evaluation), or wrap everything in a `try` block to catch any exceptions and set it to `self.error`. Here's a hypothetical example:

```python
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    def measure(self, test_case: LLMTestCase) -> float:
        # Although not required, we recommend catching errors
        # in a try block
        try:
            self.score = generate_hypothetical_score(test_case)
            if self.include_reason:
                self.reason = generate_hypothetical_reason(test_case)
            self.success = self.score >= self.threshold
            return self.score
        except Exception as e:
            # set metric error and re-raise it
            self.error = str(e)
            raise

    async def a_measure(self, test_case: LLMTestCase) -> float:
        # Although not required, we recommend catching errors
        # in a try block
        try:
            self.score = await async_generate_hypothetical_score(test_case)
            if self.include_reason:
                self.reason = await async_generate_hypothetical_reason(test_case)
            self.success = self.score >= self.threshold
            return self.score
        except Exception as e:
            # set metric error and re-raise it
            self.error = str(e)
            raise
```

:::tip

Often times, the blocking part of an LLM evaluation metric stems from the API calls made to your LLM provider (such as OpenAI's API endpoints), and so ultimately you'll have to ensure that LLM inference can indeed be made asynchronous.

If you've explored all your options and realize there is no asynchronous implementation of your LLM call (eg., if you're using an open-source model from Hugging Face's `transformers` library), simply **reuse the `measure` method in `a_measure()`**:

```python
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    async def a_measure(self, test_case: LLMTestCase) -> float:
        return self.measure(test_case)
```

You can also [click here to find an example of offloading LLM inference to a separate thread](/docs/metrics-introduction#mistral-7b-example) as a workaround, although it might not work for all use cases.
:::

### 4. Implement the `is_successful()` method

Under the hood, `deepeval` calls the `is_successful()` method to determine the status of your metric for a given `LLMTestCase`. We recommend copy and pasting the code below directly as your `is_successful()` implementation:

```python
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    def is_successful(self) -> bool:
        if self.error is not None:
            self.success = False
        else:
            return self.success
```

### 5. Name Your Custom Metric

Probably the easiest step, all that's left is to name your custom metric:

```python
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    @property
    def __name__(self):
        return "My Custom Metric"
```

**Congratulations 🎉!** You've just learnt how to build a custom metric that is 100% integrated with `deepeval`'s ecosystem. In the following section, we'll go through a few real-life examples.

## Building a Custom Non-LLM Eval

An LLM-Eval is an LLM evaluation metric that is scored using an LLM, and so a non-LLM eval is simply a metric that is not scored using an LLM. In this example, we'll demonstrate how to use the [rouge score](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation) instead:

```python
from deepeval.scorer import Scorer
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class RougeMetric(BaseMetric):
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold
        self.scorer = Scorer()

    def measure(self, test_case: LLMTestCase):
        self.score = self.scorer.rouge_score(
            prediction=test_case.actual_output,
            target=test_case.expected_output,
            score_type="rouge1"
        )
        self.success = self.score >= self.threshold
        return self.score

    # Async implementation of measure(). If async version for
    # scoring method does not exist, just reuse the measure method.
    async def a_measure(self, test_case: LLMTestCase):
        return self.measure(test_case)

    def is_successful(self):
        return self.success

    @property
    def __name__(self):
        return "Rouge Metric"
```

:::note
Although you're free to implement your own rouge scorer, you'll notice that while not documented, `deepeval` additionally offers a `scorer` module for more traditional NLP scoring method and can be found [here.](https://github.com/confident-ai/deepeval/blob/main/deepeval/scorer/scorer.py)

Be sure to run `pip install rouge-score` if `rouge-score` is not already installed in your environment.
:::

You can now run this custom metric as a standalone in a few lines of code:

```python
...

#####################
### Example Usage ###
#####################
test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")
metric = RougeMetric()

metric.measure(test_case)
print(metric.is_successful())
```

## Building a Custom Composite Metric

In this example, we'll be combining two default `deepeval` metrics as our custom metric, hence why we're calling it a "composite" metric.

We'll be combining the `AnswerRelevancyMetric` and `FaithfulnessMetric`, since we rarely see a user that cares about one but not the other.

```python
from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

class FaithfulRelevancyMetric(BaseMetric):
    def __init__(
        self,
        threshold: float = 0.5,
        evaluation_model: Optional[str] = "gpt-4-turbo",
        include_reason: bool = True,
        async_mode: bool = True,
        strict_mode: bool = False,
    ):
        self.threshold = 1 if strict_mode else threshold
        self.evaluation_model = evaluation_model
        self.include_reason = include_reason
        self.async_mode = async_mode
        self.strict_mode = strict_mode

    def measure(self, test_case: LLMTestCase):
        try:
            relevancy_metric, faithfulness_metric = initialize_metrics()
            # Remember, deepeval's default metrics follow the same pattern as your custom metric!
            relevancy_metric.measure(test_case)
            faithfulness_metric.measure(test_case)

            # Custom logic to set score, reason, and success
            set_score_reason_success(relevancy_metric, faithfulness_metric)
            return self.score
        except Exception as e:
            # Set and re-raise error
            self.error = str(e)
            raise

    async def a_measure(self, test_case: LLMTestCase):
        try:
            relevancy_metric, faithfulness_metric = initialize_metrics()
            # Here, we use the a_measure() method instead so both metrics can run concurrently
            await relevancy_metric.a_measure(test_case)
            await faithfulness_metric.a_measure(test_case)

            # Custom logic to set score, reason, and success
            set_score_reason_success(relevancy_metric, faithfulness_metric)
            return self.score
        except Exception as e:
            # Set and re-raise error
            self.error = str(e)
            raise

    def is_successful(self) -> bool:
        if self.error is not None:
            self.success = False
        else:
            return self.success

    @property
    def __name__(self):
        return "Composite Relevancy Faithfulness Metric"


    ######################
    ### Helper methods ###
    ######################
    def initialize_metrics(self):
        relevancy_metric = AnswerRelevancyMetric(
            threshold=self.threshold,
            model=self.evaluation_model,
            include_reason=self.include_reason,
            async_mode=self.async_mode,
            strict_mode=self.strict_mode
        )
        faithfulness_metric = FaithfulnessMetric(
            threshold=self.threshold,
            model=self.evaluation_model,
            include_reason=self.include_reason,
            async_mode=self.async_mode,
            strict_mode=self.strict_mode
        )
        return relevancy_metric, faithfulness_metric

    def set_score_reason_success(
        self,
        relevancy_metric: BaseMetric,
        faithfulness_metric: BaseMetric
    ):
        # Get scores and reasons for both
        relevancy_score = relevancy_metric.score
        relevancy_reason = relevancy_metric.reason
        faithfulness_score = faithfulness_metric.score
        faithfulness_reason = faithfulness_reason.reason

        # Custom logic to set score
        composite_score = min(relevancy_score, faithfulness_score)
        self.score = 0 if self.strict_mode and composite_score < self.threshold else composite_score

        # Custom logic to set reason
        if include_reason:
            self.reason = relevancy_reason + "\n" + faithfulness_reason

        # Custom logic to set success
        self.success = self.score >= self.threshold
```

Now go ahead and try to use it:

```python title="test_llm.py"
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
...

def test_llm():
    metric = FaithfulRelevancyMetric()
    test_case = LLMTestCase(...)
    assert_test(test_case, [metric])
```

```bash
deepeval test run test_llm.py
```
